{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "2b50bdff",
   "metadata": {},
   "source": [
    "## Convert PDF to Text\n",
    "This notebooks shows you how you can extract text from notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f5fb043",
   "metadata": {},
   "source": [
    "### Installing pre-required libs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 152,
   "id": "8dc02932",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install PdfReader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "f5936ee0",
   "metadata": {},
   "outputs": [],
   "source": [
    "### Importing pre-required libs\n",
    "import PyPDF2"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d35b69f0",
   "metadata": {},
   "source": [
    "<span style=\"color:blueviolet\">Step 1. By using below method you can get data from pdf</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "5e4f1f7f",
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_text_from_pdf(pdf_file_path):\n",
    "    with open(pdf_file_path, 'rb') as file:\n",
    "        reader = PyPDF2.PdfReader(file)\n",
    "        num_pages = len(reader.pages)\n",
    "\n",
    "        combined_text = \"\"\n",
    "        for page_number in range(num_pages):\n",
    "            page = reader.pages[page_number]\n",
    "            text = page.extract_text()\n",
    "            combined_text += text\n",
    "\n",
    "        return combined_text\n",
    "\n",
    "# Usage example\n",
    "pdf_path = '1_55661.pdf'\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4cd4814e",
   "metadata": {},
   "source": [
    "<span style=\"color:blueviolet\">Step 2. By using below method you will get separate paragraphs</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "id": "583bc97d",
   "metadata": {},
   "outputs": [],
   "source": [
    "def separate_paragraphs(text):\n",
    "    paragraphs = text.split(\"  \\n\")  # Split text by double line breaks\n",
    "\n",
    "    # Remove leading and trailing whitespaces from each paragraph\n",
    "    paragraphs = [paragraph.strip() for paragraph in paragraphs]\n",
    "\n",
    "    return paragraphs\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8678e5f2",
   "metadata": {},
   "source": [
    "<span style=\"color:blueviolet\">Step 3. Counting Total Paragraphs from pdf</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 137,
   "id": "93f6fa91",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "283"
      ]
     },
     "execution_count": 137,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "full_doc = extract_text_from_pdf(pdf_path)\n",
    "full_paras = separate_paragraphs(full_doc)\n",
    "len(full_paras)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "id": "1908f943",
   "metadata": {},
   "outputs": [],
   "source": [
    "question = \"what are the supported platform for ssh runner?\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1fa3bcef",
   "metadata": {},
   "source": [
    "### Retrieve the Context from Text "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 187,
   "id": "20584c80",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Most Relevant context: First, download the gitlab-runner distribution for the appropriate platform at \n",
      "https://docs.gitlab.com/runner/install/ . In this configura tion, the SSH GitLab runner is installed on a Linux \n",
      "x86 Red Hat distribution, but similar installation and configuration can be performed on other supported \n",
      "platforms.\n",
      "\n",
      "The SSH runners are executed on a supported platform (Windows, Linux or MacOS) and connect to a \n",
      "target machine through SSH for the pipeline execution. In that configuration, SSH runners act like \n",
      "gateways to connect platforms: the GitLab server will send the pipeline actions to the SSH runner which \n",
      "will forward them to the target machine, in this case th e z/OS environment. The different stages of the \n",
      "pipeline will be executed on the target z/OS machine and results will be sent back to the GitLab server \n",
      "through the same mechanism.\n",
      "\n",
      "Provide the GitLab server URL and the registration token to register the runner. Then provide a description \n",
      "for this runner and leave empty when prompted for tags  (tags can be modified later if necessary) . Provide \n",
      "the type of executor by typing ssh and provide the nec essary information for SSH communication (IP or \n",
      "hostname of the target z/OS machine, port, username and password or path to the SSH identity file). \n",
      "When finished, start the runner by issuing the sudo g itlab-runner start  command.\n",
      "\n",
      "Similar pipeline definitions are configured for the EPSM project and for the zAppBuild project.\n",
      "\n",
      "In Linux, install the downloaded package with the command rpm -i gitlab-runner_amd64.rpm . \n",
      "Then register the runner to the GitLab server, by issuing the sudo  gitlab-runner register  command. The \n",
      "configurator prompts for several pieces of information that are displayed in the GitLab server \n",
      "Administration area (Runners section):\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "1789"
      ]
     },
     "execution_count": 187,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.metrics.pairwise import cosine_similarity\n",
    "\n",
    "def find_most_similar_text(sentence, text_objects):\n",
    "    # Create a TF-IDF vectorizer\n",
    "    vectorizer = TfidfVectorizer()\n",
    "\n",
    "    # Fit the vectorizer on text_objects\n",
    "    tfidf_matrix = vectorizer.fit_transform(text_objects)\n",
    "\n",
    "    # Transform the input sentence using the fitted vectorizer\n",
    "    sentence_vector = vectorizer.transform([sentence])\n",
    "\n",
    "    # Calculate cosine similarities between the sentence vector and all text objects\n",
    "    similarities = cosine_similarity(sentence_vector, tfidf_matrix)\n",
    "\n",
    "    # Find the index of the most similar text object\n",
    "    most_similar_index = similarities.argmax()\n",
    "\n",
    "    # Get the most similar text object\n",
    "    most_similar_text = text_objects[most_similar_index]\n",
    "\n",
    "    # Check the length of the most similar text object\n",
    "    if len(most_similar_text) < 2000:\n",
    "        # Find the indices and similarities of the related text objects\n",
    "        sorted_indices = similarities.argsort()[0][::-1]\n",
    "        sorted_similarities = similarities[0, sorted_indices]\n",
    "\n",
    "        # Iterate over the related text objects and add paragraphs until length exceeds 2000 characters\n",
    "        for idx, similarity in zip(sorted_indices, sorted_similarities):\n",
    "            if idx != most_similar_index:\n",
    "                related_text = text_objects[idx]\n",
    "                if len(most_similar_text) + len(related_text) + 2 > 2000:  # +2 for \"\\n\\n\"\n",
    "                    break\n",
    "                most_similar_text += \"\\n\\n\" + related_text\n",
    "\n",
    "    return most_similar_text\n",
    "question = \"what are the supported platform for ssh runner?\"\n",
    "\n",
    "most_relevant_text = find_most_similar_text(question, ful_paras)\n",
    "print(\"Most Relevant context:\", most_relevant_text)\n",
    "len(most_relevant_text)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
